13 research outputs found
Memory-Efficient Recursive Evaluation of 3-Center Gaussian Integrals
To improve the efficiency of Gaussian integral evaluation on modern
accelerated architectures FLOP-efficient Obara-Saika-based recursive evaluation
schemes are optimized for the memory footprint. For the 3-center 2-particle
integrals that are key for the evaluation of Coulomb and other 2-particle
interactions in the density-fitting approximation the use of multi-quantal
recurrences (in which multiple quanta are created or transferred at once) is
shown to produce significant memory savings. Other innovation include
leveraging register memory for reduced memory footprint and direct compile-time
generation of optimized kernels (instead of custom code generation) with
compile-time features of modern C++/CUDA. High efficiency of the CPU- and
CUDA-based implementation of the proposed schemes is demonstrated for both the
individual batches of integrals involving up to Gaussians with low and high
angular momenta (up to ) and contraction degrees, as well as for the
density-fitting-based evaluation of the Coulomb potential. The computer
implementation is available in the open-source LibintX library.Comment: 37 pages, 2 figures, 6 table
Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units
An implementation is presented of an uncontracted Rys quadrature algorithm for electron repulsion integrals, including up to g functions on graphical processing units (GPUs). The general GPU programming model, the challenges associated with implementing the Rys quadrature on these highly parallel emerging architectures, and a new approach to implementing the quadrature are outlined. The performance of the implementation is evaluated for single and double precision on two different types of GPU devices. The performance obtained is on par with the matrix−vector routine from the CUDA basic linear algebra subroutines (CUBLAS) library
New Multithreaded Hybrid CPU/GPU Approach to Hartree−Fock
In this article, a new multithreaded Hartree–Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code
Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory
With the growing reliance of modern supercomputers on accelerator-based
architectures such a GPUs, the development and optimization of electronic
structure methods to exploit these massively parallel resources has become a
recent priority. While significant strides have been made in the development of
GPU accelerated, distributed memory algorithms for many-body (e.g.
coupled-cluster) and spectral single-body (e.g. planewave, real-space and
finite-element density functional theory [DFT]), the vast majority of
GPU-accelerated Gaussian atomic orbital methods have focused on shared memory
systems with only a handful of examples pursuing massive parallelism on
distributed memory GPU architectures. In the present work, we present a set of
distributed memory algorithms for the evaluation of the Coulomb and
exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via
direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods,
respectively. The absolute performance and strong scalability of the developed
methods are demonstrated on systems ranging from a few hundred to over one
thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter
supercomputer.Comment: 45 pages, 9 figure
Modernizing the core quantum chemistry algorithms
This document covers the basics of computational chemistry and how using the modern programming techniques the theory can be efficiently implemented on digital computers.
The computer implementations are developed from the core two-electron integrals to many-body and coupled cluster algorithms. A particular attention is paid to the physical constraints of he computer resources and the emergence of the novel architectures.</p
High-performance evaluation of high angular momentum 4-center Gaussian integrals on modern accelerated processors
We present a high-performance evaluation method for 4-center 2-particle
integrals over Gaussian atomic orbitals with high angular momenta ()
and arbitrary contraction degrees on graphical processing units (GPUs) and
other accelerators. The implementation uses the matrix form of
McMurchie-Davidson recurrences. Evaluation of the 4-center integrals over four
() Gaussian AOs in the double precision (FP64) on an NVIDIA V100 GPU
outperforms the reference implementation of the Obara-Saika recurrences () running on a single Intel Xeon core by more than a factor of 1000,
healthily exceeding the 73:1 ratio of the respective hardware peak FLOP rates
while reaching almost 50\% of the V100 peak. The approach can be extended to
support AOs with even higher angular momenta; for low angular momenta
alternative approaches will be needed to achieve optimal performance. The
implementation is part of an open-source library feely
available at
New Multithreaded Hybrid CPU/GPU Approach to Hartree−Fock
In this article, a new multithreaded Hartree–Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code.Reprinted (adapted) with permission from Journal of Chemical Theory and Computation 8 (2012): 4166, doi:10.1021/ct300526w. Copyright 2012 American Chemical Society.</p
Fast and Flexible Coupled Cluster Implementation
A new coupled cluster singles and doubles with triples correction, CCSD(T), algorithm is presented. The new algorithm is implemented in object oriented C++, has a low memory footprint, fast execution time, low I/O overhead, and a flexible storage backend with the ability to use either distributed memory or a file system for storage. The algorithm is demonstrated to work well on single workstations, a small cluster, and a high-end Cray computer. With the new implementation, a CCSD(T) calculation with several hundred basis functions and a few dozen occupied orbitals can run in under a day on a single workstation. The algorithm has also been implemented for graphical processing unit (GPU) architecture, giving a modest improvement. Benchmarks are provided for both CPU and GPU hardware.Reprinted (adapted) with permission from Journal of Chemical Theory and Computation 9 (2013): 3385, doi:10.1021/ct400054m. Copyright 2013 American Chemical Society.</p